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Abstract. Finding appropriate weight of each attribute is one of the main 
points in the model generalization via clustering and classification methods, 
giving more or less importance to the geographic information. The purpose 
of this study is to determine the weights of the geometric, topologic and 
semantic attributes by implementing a statistical method which is chi- 
squared test of independence in order to calculate their contributions to the 
overall goal of selection. Two different drainage patterns in the USGS Na- 
tional Hydrographic Datasets which are dendritic and trellis with rectangu- 
lar at i:24,ooo-scale were weighted. 
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1. Introduction 

Generalization is a process used for reducing the volume of data of a spatial 
data set while preserving important structures (Sester 2008). Map generali- 
zation operations could be grouped into two categories: (1) model generali- 
zation which is a filtering process to obtain a subset of a target database for 
data analysis (Joao 1998), and (2) cartographic generalization which is the 
set of operations concerned with the optimal visualization of the selected 
data (Mackaness 2007). Selection is often used interchangeably with the 
model generalization and database abstraction. Topfer's Radical Law 
(Topfer and Pillewiser 1966) is the only quantitative rule in the selection of 
the features. It yields the number of features to be displayed, but does not 
reveal which of the features should be chosen. 

In the river network generalization, two questions should be answered: 
How many branches to be selected? Which branches are important? 
Topfer's Radical Law has answered the first question. But second question 
is not easy to answer. 



With reference to United States Geological Survey (USGS) Draft Standards 
for 1:100,000 National Hydrography Dataset (NHD), capture conditions 
are 

• If stream/river is perennial and flows from lake/pond or spring/seep, 
then capture. 

• If stream/river is intermittent, and can be definitely located, and flows 
from lake/pond or spring/seep, then capture. 

• If stream/river is perennial and is greater than or equal to 0.63" along 
the longest axis, then capture. 

• If stream/river is intermittent, and can be definitely located, and is not 
in an arid region, and is greater than or equal to 0.63" along the longest 
axis, then capture. 

• If stream/river is intermittent, and can be definitely located, and is in 
an arid region, and greater than or equal to 1.2" along the longest axis, 
then capture. 

If the whole context of river network is not considered with geometric, 
topologic and semantic attributes, the abstraction will dramatically destroy 
the original structure. "Which attributes contribute to the selection of riv- 
ers?" and "Comparing the attributes to the original river networks, which 
one is more important in model generalization?" These questions are very 
important in the decision of selection in terms of preserving the original 
structure of river network. Finding appropriate weight of each attribute is 
one of the main points in model generalization via clustering and classifica- 
tion, giving more or less importance to the geographic information. 

In the generalization literature related to weighting, Bjorke (1997) showed 
how a weight function can be used to control the spatial distribution of the 
map symbols. Wolf (1988) suggested a weighted network data for the hy- 
drographic generalization. Jiang and Harrie (2004) were empirically 
weighted their attributes to be used in the road network generalization by 
Self Organizing Maps. Zhou and Jones (2005) introduced weighted effec- 
tive area, a set of area-based metrics for cartographic line generalization. 
Kulik et al. (2005) assigned weights consist of geometric and semantic 
components to each vertex of every line for deleting points in line simplifi- 
cation. Podolskaya (2007) used weights in quality assessment of generaliza- 
tion. Gulgen and Gokgoz (2011) used the weighted mean calculated by us- 
ing the inverse of the number of roads with the same connectivity value for 
the threshold value determining the important roads. 

Lotfi and Fallahnejad (2010) categorized various methods for finding 
weights in the multi attribute decision making literature into two groups: 



subjective and objective weights. Subjective weights are determined only 
according to the preference decision makers. The AHP method, weighted 
least squares method and Delphi method belong in this category. The objec- 
tive methods determine weights by solving mathematical models without 
any consideration of the decision maker's preferences, for example, the en- 
tropy method, multiple objective programming, principal element analysis, 
etc. 

It may be a useful way to compare different datasets at different resolutions 
to determine the contribution of the geometric, topologic and semantic at- 
tributes to the overall goal of selection. To this end, in this study, weights 
were determined by implementing the chi-square (x 2 ) test of independence 
to find the association between attributes and selection, and Cramer's V 
coefficient (cp c ) (Cramer 1946) to find the strength of association. Each nu- 
merical weight is calculated by normalization dividing each cp c with the sum 
of cpcS. Two different drainage patterns in the USGS NHDs which are den- 
dritic and trellis with rectangular at i:24,ooo-scale were weighted. 

2. Weighting by Chi-Squared Test of Independence 

Pearson (i900)'s x 2 test is a very popular non-parametric test and used for 
multinomial data. Hence, the values of a quantitative variable must be 
transformed to the classes of categorical variable (LeBlanc 2004). 

Pearson's/ 2 statistic is 

T = Y,{X-E) 2 /E (l) 

where E is the expected frequency of the observed frequency X. If T is 
greater than the critical value which has significance level a and degree of 
freedom k, H is rejected and Hi is accepted. Figure 1 shows the right tailed 
X 2 curve, and acceptance and rejection regions at significance level a. 
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Figure L Right tailed x 2 curve with acceptance and rejection regions at signifi- 
cance level a. 



The x 2 test of independence (also called contingency table analysis) is used 
to evaluate whether or not there is an association between individuals fall- 
ing into specific classes of one categorical variable and their membership in 
classes of a second categorical variable. If we reject the null hypothesis (H ) 
of independence, this is equivalent of concluding that there is an associa- 
tion (LeBlanc 2004). 

In this study, 

• H Q : There is no association between an attribute and the selection re- 
sult. 

• Hu There is an association between an attribute and the selection re- 
sult. 

In order to weight the attributes, strength of x 2 association is calculated by 
cp c (equation 2) and normalized as the sum of weights is equal to one. 

<p c = y/T/n(m-l) (2) 

where n is the number of objects and m is the smaller number of classes in 
two categorical variables compared. 



3. Case Study 

In this study, we used four NHDs which are vector geospatial data layers of 
the National Map, being developed by the USGS. They include two different 
drainage patterns at i:24,ooo-scale and i:ioo,ooo-scale. Pomme De Terre 
(PT) shows dendritic pattern, while South Branch Potomac (SBP) shows 
trellis with rectangular pattern. The properties of drainage patterns are de- 
scribed in Howard (1967) and Debarry (2004). Sample drainage patterns 
are shown in Figure 2. 
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Figure 2. (a) Dentritic, (b) trellis, and (c) rectangular pattern (Debarry 2004). 

From the view of selection, it is important that a river should be processed 
as a whole, and segments should not be eliminated separately, which may 
disconnect the graph (Stanislawski and Savino 2011). Thus, the river seg- 
ments were combined using the data of stream level and river type, which 



are embedded in the datasets. Following seven geometric, topologic and 
semantic attributes were considered as associated with the selection of the 
river networks. The NHD database was enriched with some new attributes 
as follows. 

1. Geometric attributes: 

a) Length (ratio-scaled): The lengths of the rivers. 

b) Sinuosity (ratio-scaled): The ratio of the Euclidean distance between 
the end points to the river length. 

2. Topologic attributes: The topologic attributes of the river networks 
were calculated by three popular centrality measures. The idea of cen- 
trality is specifically concerned with communication in small groups 
and a relationship between structural centrality and influence in group 
processes (Freeman, 1978). 

a) Degree centrality (ratio-scaled): Number of links that a node has. 

b) Betweenness centrality (ratio-scaled): Frequency of shortest linking 
between the nodes p ; and pj that node pk resides on. 

c) Closeness (ratio-scaled): Inverse of the distance of each node to eve- 
ry other node in the network. 

3. Semantic attributes: 

a) Stream level: A stream level in NHD is a numeric code that identi- 
fies a hierarchy of main paths of water flow through the network. 
Level values are established for the purpose of computationally 
traversing the drainage network through flow relations identified 
between the reaches which are the segments of a river network. 

b) River type: There are two river types. Perennial rivers are those that 
flow continuously, whereas intermittent rivers appear to dry up 
when the flow has the potential of being totally absorbed by the bed 
and underlying material. Intermittent rivers may flow continuously 
during wet years ("0" and "1" values were assigned to intermittent 
and perennial rivers, respectively). 

The Kolmogrov-Smirnov test for goodness-of-fit (Massey, 1951) was applied 
for determining the normality of the attributes. All attributes have skewed 
distributions. Thus, we need a non-parametric test for testing the inde- 
pendence. The/ 2 test of independence is very useful for this case. 

Because of using for multinomial data, the length attribute was categorized 
into two classes regarding to the USGS Draft Standards for 1:100,000 
NHD: If a river is longer than or equal to 1.6km, then it is included into the 



first, otherwise into the second class. The other ratio-scaled attributes were 
categorized based on their standard deviations (10 interval). 

In this study, for calculating the weight of each parameter in terms of its 
contribution to the overall goal of selection, the associations between an 
attribute and selection result assigned binary integers (o: eliminated; 
l: selected) at original were compared. If there is an association, H hypoth- 
esis (there is no association) is rejected and the difference between ob- 
served and expected frequencies is significant. The cp c provided the strength 
of association (equation 2). The determined weights were calculated by 
normalizing as the sum of weights is equal to one (dividing each cp c by the 
sum of cp c s). 

4. Results 

The determined weights of the attributes are given in Table 1 and Figure 3. 
They are determined with respect to three significance levels (a: 0.05, 0.01 
and 0.001). 



Subbasin 


Length 


Sinuosity 


Degree 


Between 


Closeness 


Str. Level 


Type 


PT 


30% 


14% 


19% 


21% 


No association 


11% 


5% 


SBP 


21% 


7% 


18% 


12% 


8% 


14% 


20% 



Table L The determined weights of the attributes for PT and SBP. 



35% 1 
30% - 
25% - 
20% - 
15% - 
10% - 
5% - 
0% - 



Figure 3. The graphic of the determined weights of the attributes for PT and SBP. 

In Figure 4a and 5a, drainage networks of PT and SBP at i:ioo,ooo-scales 
are shown at i:i,ooo,ooo-scale and i:i,500,ooo-scale, respectively. In the 
close-ups (i.e. Figure 4b-h and 5b-h), drainage networks at 1:100,000 and 
i:24,ooo-scale are shown with together. The branches with the buffers are 
that of the drainage networks at i:ioo,ooo-scale. Moreover, each branch is 




colored with respect to the classes of the attributes as they are in the leg- 
ends. Furthermore, the black lines in the close-ups are the centerlines. The 
results for each attributes can be summarized as follows. 

• The weights indicate that the length is the most important attribute as- 
sociated with selection. As they are shown in Figure 4b and 5b, the riv- 
ers in red (>1.6 km) are mostly selected for i:ioo,ooo-scale. However, it 
is needed to the rivers in yellow (<1.6 km) in behalf of drainage continu- 
ity- 

• Generally, very sinuous rivers are selected for i:ioo,ooo-scale (red and 
yellow in Figure 4e and 5I1). However, the sinuosity does not guarantee 
the continuity as well. 

• The degree centrality gives the main rivers which have more links (Fig- 
ure 4d and 5d). However, it could not provide the continuity. The 
weights of degree centrality are very close for both river networks. 

• The betweenness centrality could give both main rivers and linkages 
between centerlines by small river segments. It is more effective in PT 
(Figure 4c and 5f). 

• The closeness centrality of PT is not associated with the selection result 
because the x 2 statistic of closeness is smaller than the critical value. As 
a result, in Figure 4I1, almost all rivers are in green, i.e. there is only one 
class. However, in Figure 5g, almost all rivers are not in green, i.e. there 
more than one class, but they reflect weak association. To be honest, 
this attribute is not so useful for selection. 

• The stream level could show the main rivers and provide linkage be- 
tween centerlines with high hierarchy values. However, it is not so pow- 
erful for middle levels (Figure 4f and 5e). 

• The weight of river type is higher in SBP, because the perennial rivers 
are homogenously distributed, and connection between the main rivers 
and the intermittent rivers in the trellis with the rectangular pattern of 
SBP are the perennial rivers. On the other hand, the perennial rivers are 
heterogeneously distributed in the dendritic pattern of PT (Figure 4g 
and 5 c). 

5. Conclusions 

Weighting of the attributes is useful in model generalization via clustering 
or classification. The proposed approach could be used for this aim. Note 
that, the weights should not be perceived in general, just only specific for 
PT and SBP. Generic weights may be determined with the weights to be 
calculated in some more characteristic subbasins conterminous US regard- 
ing all patterns. 
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